Space-time behavior based correlation

ABSTRACT

A method includes measuring the likelihood that two different space-time video segments could have resulted from a similar underlying motion field without computing the field. The method may be employed to identify locations in a video sequence where at least one behavioral phenomenon similar to that demonstrated in the video segment occurs. For example, the phenomenon might be a dynamic behavior, an action, a rigid motion, and/or a non-rigid motion.

FIELD OF THE INVENTION

The present invention relates to action recognition in video sequences.

BACKGROUND OF THE INVENTION

Different people with similar behaviors induce completely differentspace-time intensity patterns in a recorded video sequence. This isbecause they wear different clothes and their surrounding backgroundsare different. What is common across such sequences of same behaviors isthe underlying induced motion fields. Efros et. al (in A. A. Efros, A.C. Berg, G. Mori and J. Malik. Recognizing action at a distance. ICCV,October 2003) employed this observation by using low-pass filteredoptical-flow fields (between pairs of frames) for action recognition.

However, dense unconstrained and non-rigid motion estimation is highlynoisy and unreliable. Clothes worn by different people performing thesame action often have very different spatial properties (differentcolor, texture, etc.) Uniform-colored clothes induce local apertureeffects, especially when the observed acting person is large (which iswhy Efros et. al analyzed small people, “at a glance”). Dense flowestimation is even more unreliable when the dynamic event containsunstructured objects, like running water, flickering fire, etc.

Prior art methods for action-recognition in video sequences are limitedin a variety of ways. The methods proposed by Bobick et. al (A. Bobickand J. Davis. The recognition of human movement using temporaltemplates. PAMI, 23(3):257-267, 2001) and Sullivan et. al (J. Sullivanand S. Carlsson. Recognizing and tracking human action. In ECCV, 2002)require prior foreground/background segmentation. The methods proposedby Yacoob et. al (Y. Yacoob and J. J. Black. Parametrized modeling andrecognition of activities. CVIU, 73(2):232-247, 1999), Black (M. J.Black. Explaining optical flow events with parameterized spatio-temporalmodels, in CVPR, 1999), Bregler (C. Bregler. Learning and recognizinghuman dynamics in video sequences. CVPR, June 1997), Chomat et. al (O.Chomat and J. L. Crowley. Probabilistic sensor for the perception ofactivities. ECCV, 2000), and Bobick et. al require prior modeling orlearning of activities, and are therefore restricted to a small set ofpredefined activities. The methods proposed by Efros et. al, Yacoob et.al, and Black require explicit motion estimation or tracking, whichentail the fundamental hurdles of optical flow estimation (apertureproblems, singularities, etc.)

SUMMARY OF THE INVENTION

It is an object of the present invention to provide a novel method forhandling behavioral phenomena in a video sequence.

There is therefore provided, in accordance with a preferred embodimentof the present invention, a method including measuring the likelihoodthat two different space-time video segments could have resulted from asimilar underlying motion field without computing the field.

Moreover, in accordance with a preferred embodiment of the presentinvention, the measuring utilizes pixel values directly. The pixelvalues are at least one of the following: pixel intensities, pixelcolors, filtered intensities, local SSD (sum of square differences)surfaces, correlation surfaces and normalized correlation surfaces.

Further, in accordance with a preferred embodiment of the presentinvention, the measuring includes determining local motion consistencybetween corresponding, relatively small, space-time patches of the videosegments. The corresponding space-time patches may be locally shiftedrelative to each other.

Still further, in accordance with a preferred embodiment of the presentinvention, the measuring includes comparing the space-time patches to aset of representative patches to generate feature vectors anddetermining motion consistency as a function of the distance between thefeature vectors of the corresponding space-time patches.

In accordance with an alternative preferred embodiment of the presentinvention, the measuring includes comparing the space-time patches to aset of representative patches to generate feature vectors anddetermining motion consistency as a function of the distance betweendistributions of the feature vectors extracted from space-timecorresponding regions within the video segments.

Moreover, in accordance with a preferred embodiment of the presentinvention, the determining includes calculating whether a vector u witha non-zero temporal component exists which is perpendicular tospace-time gradients of pixel values in the corresponding space-timepatches.

Additionally, in accordance with a preferred embodiment of the presentinvention, the determining includes calculating a rank increase betweena 2×2 upper-left minor matrix of a 3×3 gram matrix of the space-timegradients and the 3×3 gram matrix. The rank increase measure may be acontinuous rank increase measure and/or it may be approximate.

Further, in accordance with a preferred embodiment of the presentinvention, the method may include computing local consistency scores asa function of the local motion consistency for a multiplicity of thespace-time patches within the video segments, aggregating the localconsistency scores into a global correlation score between the videosegment and the video sequence, determining a correlation volume of thevideo sequence with respect to the video segment and identifying peaksin the correlation volume, the peaks denoting locations in the videosequence where behavioral phenomena occur.

Still further, in accordance with a preferred embodiment of the presentinvention, the determining includes generating multiple resolutions ofthe video sequence and the video segment, wherein multiple resolutionsare in at least one of space and time, searching a coarse resolution ofthe video segment within a coarse resolution of the video sequence tofind locations with high match values, refining the search in a finerresolution around areas in the finer resolution corresponding to thelocations in the coarse resolution and repeating the refined searchuntil reaching a desired resolution level.

Additionally, in accordance with a preferred embodiment of the presentinvention, the measuring is employed for identifying locations in avideo sequence where at least one behavioral phenomenon similar to thatdemonstrated in a video segment occurs. The at least one phenomenon maybe a dynamic behavior, an action, a rigid motion, and/or a non-rigidmotion. It may be a multiplicity of phenomena occurring within afield-of-view of a camera.

Moreover, in accordance with a preferred embodiment of the presentinvention, an entity performing the at least one phenomenon in the videosegment does not have the same appearance as an entity performing aphenomenon similar to the at least one phenomenon in the video sequence.The entity may be a person, an animal a rigid object, and/or a non-rigidobject.

Further, in accordance with a preferred embodiment of the presentinvention, the measuring is employed for at least one of the following:video search, indexing and fast forward to a next phenomenon of thebehavioral phenomenon.

Alternatively, in accordance with a preferred embodiment of the presentinvention, the measuring is employed for automatic video sequenceregistration or alignment, or for action based identification of anentity.

In accordance with a preferred embodiment of the present invention, theidentifying is employed for spatio-temporal clustering or for videosegmentation into sequences of similar phenomena.

There is also provided, in accordance with a preferred embodiment of thepresent invention, a method including constructing a correlation volumebetween a video segment and a video sequence.

There is also provided, in accordance with a preferred embodiment of thepresent invention, a method including measuring a rank-increase betweena 2×2 upper-left minor matrix of a 3×3 gram matrix and the 3×3 grammatrix as a function of the discrepancy between the eigenvalues of thegram matrix and the minor matrix.

Moreover, in accordance with a preferred embodiment of the presentinvention, the function includes the ratio between the product of thethree eigenvalues of the gram matrix and the two eigenvalues of theminor matrix.

Further, in accordance with a preferred embodiment of the presentinvention, the function includes the difference between the sum of thethree eigenvalues of the gram matrix and the two eigenvalues of theminor matrix.

Still further, in accordance with a preferred embodiment of the presentinvention, the function includes calculating the following:

${\Delta \; r} = \frac{\lambda_{2} \cdot \lambda_{3}}{\lambda_{1}^{♦} \cdot \lambda_{2}^{♦}}$

where the eigenvalues of the gram matrix are λ₁≧λ₂≧λ₃, and theeigenvalues of the minor matrix are λ₁ ⁰≧λ₂ ⁰.

BRIEF DESCRIPTION OF THE DRAWINGS

The subject matter regarded as the invention is particularly pointed outand distinctly claimed in the concluding portion of the specification.The invention, however, both as to organization and method of operation,together with objects, features, and advantages thereof, may best beunderstood by reference to the following detailed description when readwith the accompanying drawings in which:

FIG. 1 is an illustrative example of two-dimensional image correlation;

FIG. 2 is a schematic illustration of three-dimensional imagecorrelation of an exemplary video template against an exemplary videosequence in accordance with a preferred embodiment of the presentinvention,

FIG. 3 is a flow chart illustration of a method of identifying locationsin a video sequence where a motion demonstrated in a video segmentoccurs, constructed and operative in accordance with the presentinvention;

FIG. 4 is a schematic illustration showing how consistency between thespace-time (ST)-patches of FIG. 2 may be measured, in accordance with apreferred embodiment of the present invention;

FIGS. 5A, 5B and 5C are flow chart illustrations of the derivation ofthe calculation used for measuring consistency in the method of FIG. 3;

FIG. 6 is an illustration of the employment of the present invention todetect instances of people walking in a video recorded on a beach;

FIG. 7 is an illustration of the employment of the present invention todetect instances of a dancer performing a particular turn in a video ofa ballet; and

FIG. 8 is an illustration of the employment of the present invention todetect, in one video, instances of five different activities, eachdemonstrated in a separate video template.

It will be appreciated that for simplicity and clarity of illustration,elements shown in the figures have not necessarily been drawn to scale.For example, the dimensions of some of the elements may be exaggeratedrelative to other elements for clarity. Further, where consideredappropriate, reference numerals may be repeated among the figures toindicate corresponding or analogous elements.

DETAILED DESCRIPTION OF THE INVENTION

In the following detailed description, numerous specific details are setforth in order to provide a thorough understanding of the invention.However, it will be understood by those skilled in the art that thepresent invention may be practiced without these specific details. Inother instances, well-known methods, procedures, and components have notbeen described in detail so as not to obscure the present invention.

Applicants have realized that the constraints to action recognition invideo presented by the methods of the prior art may be circumvented by anovel approach to action recognition in video.

Applicants have realized that 2-dimensional image correlation, which iswell known in the art, may be extended into a 3-dimensional space-timevideo-template correlation. Applicants have further realized that thenovel 3-dimensional space-time video-template correlation may beemployed to recognize similar actions in video segments.

In 2-dimensional image correlation, a template of the desired image (thecorrelation kernel) is compared with the actual camera image of anobject and a new image (the correlation image) is generated. The peakcorrelation values in the correlation image indicate where the templatematches the camera image.

Reference is now made to FIG. 1 which shows an example of templatecorrelation. In the example shown in FIG. 1, a template 10 of thedesired image is an eye of a monkey shown in a camera image 15. Acomparison of template 10 with camera image 15 generates the correlationimage 20 (overlaid on top of the camera image), which indicates wherethe template matches the camera image. As shown in FIG. 1, theparticularly bright patches in correlation image 20 indicate thepositions of the eyes of the monkey, where camera image 15 matchestemplate 10.

When image correlation is conducted in three dimensions in accordancewith a preferred embodiment of the present invention, templatescomprising small space-time video segments (small video clips) may be“correlated” against entire video sequences in all three dimensions (x,y, and t). Peak correlation values may correspond to segments of thevideo sequences in which dynamic behaviors similar to those in thetemplate occur.

Applicants have realized that action recognition via 3-dimensionalspace-time video template correlation may detect very complex behaviorsin video sequences (e.g., ballet movements, pool dives, running water),even when multiple complex activities occur simultaneously within thefield-of-view of the camera.

Applicants have further realized that the similarity between twodifferent space-time intensity patterns of two different video segmentsmay be measured directly from the intensity information, withoutexplicitly computing the underlying motions. This type of measurementmay allow for the detection of similarity between video segments ofdifferently dressed people performing the same type of activity, withoutrequiring foreground/background segmentation, prior learning ofactivities, motion estimation or tracking.

The present invention may identify locations in a video sequence where amotion demonstrated in a video segment occurs. The present invention mayalso be a behavior-based similarity measure which may indicate thelikelihood that two different space-time intensity patterns of twodifferent video segments result from a similar underlying motion field,without explicitly computing that motion field.

Reference is now made to FIG. 2 which shows the video components which,in accordance with a preferred embodiment of the present invention, maybe considered in order to correlate templates of small space-time videosegments (small video clips) against entire video sequences in all threedimensions (x, y, and t). As shown in FIG. 2, T may be a smallspace-time template which is correlated against a larger video sequenceV. T may be a very small video clip, e.g., 30×30 pixels by 30 frames,while V may be a video of a standard format size (e.g. in PAL—720×576and in NTSC—640×480) and of a standard video length (e.g. severalminutes or hours). It will be appreciated that the sizes here areexemplary only and that the present invention is operable for manydifferent sized small video clips T and videos V.

The correlation of T against V may generate a space-time “behavioralcorrelation volume” C(x, y, t), which may be analogous to the2-dimensional correlation surface 20 introduced in FIG. 1. Peaks withincorrelation volume C may be occurrences of behavior in video sequence Vsimilar to the behavior in template T.

In accordance with a preferred embodiment of the present invention, eachvalue in correlation volume C may be computed by measuring the degree ofbehavioral similarity between two video segments: the space-timetemplate T, and a video segment S⊂V, where S (FIG. 2) may be of the samedimensions as T and may be centered around the point (x, y, t)εV. Thebehavioral similarity between video segments T and S may be evaluated bycomputing and integrating local consistency measures between smallspace-time patches P within these video segments. Each space-time patch(ST-patch) P may be 7×7 pixels by 3 frames, or any other small volume.As shown in FIG. 2, for each point (x, y, t)εS, an ST-patch P_(S)⊂Scentered around (x, y, t) may be compared, in a suitable manner, againstits corresponding ST-patch P_(T)⊂T at the matching location. Lines 31indicate exemplary corresponding ST-patches P_(T) and P_(S) in T and Srespectively.

These local scores may then be aggregated to provide a globalcorrelation score for the entire template T at this video location, in asimilar manner to the way correlation of image templates may beperformed. However, here the small patches P also have a temporaldimension, and thus the similarity measure between patches may capturesimilarity of the implicit underlying motions.

Reference is now made to FIG. 3, which is a flow chart illustrationsummarizing the method steps developed by Applicants and describedhereinabove, which method may be tangibly embodied in a program ofinstructions executable by a machine. As shown in FIG. 3, the methodbegins by measuring (step 22) the consistency between ST-patches P_(T)and P_(S), after which, the method may aggregate (step 24) the localconsistency scores into a more global behavior-based correlation scorebetween video template T and video segment S. The method may continue byconstructing (step 26) some type of correlation volume C of video V withrespect to video template T. Finally, the method may identify (step 28)the peaks in correlation volume C as locations in video V wherebehavioral phenomena similar to those seen in template T occur.

When measuring (step 22) the consistency, each ST-patch P_(S) may becompared to the associated patch P_(T) or to a small neighborhood aroundthe matching location in video template T (one patch against many). Thelatter embodiment may account for non-rigid deformations of thebehavior. When comparing one patch to multiple patches, the method mayselect the best match among all the matches or a weighted average ofthem Another possibility is to compare a small neighborhood in S to thematching, small, neighborhood in T (several patches against severalpatches). The comparison in this case may be of the local statistics ofthe behaviors and, as is discussed in detail hereinbelow, may involvecomparing histograms or distributions of feature vectors extracted usingthe similarity measure.

FIG. 4, reference to which is now made, shows how consistency betweenST-patches P_(T) and P_(S) may be measured (step 22) in accordance witha preferred embodiment of the present invention. In FIG. 4, objects O₁,O₂ and O₃ are shown to move through space over time in video V. Eachobject O induces a “brush” of intensity curves I in the space-timevolume. Applicants have realized that curves I may be complex, but thatlocally (within a small space-time patch), curves I may be assumed to bestraight lines 33 as shown in detail 34 of space-time patch P_(x).

Applicants have realized that this assumption may be true for mostST-patches in real video sequences. A very small number of patches inthe video sequence may violate this assumption. The latter may bepatches located at motion discontinuities, as well as patches thatcontain an abrupt temporal change in the motion direction or velocity.

For most patches, the intensity lines 33 within a single ST-patch maythus be oriented in a single space-time direction ū=[u,v,w] as shown indetail 34. The direction ū may be different for different points (x,y,t)in the video sequence. It may be assumed to be uniform only locally,within a small ST-patch P centered around each point in the video.

Space-time gradients ΔP_(i)=(P_(x) _(i) , P_(y) _(i) , P_(t) _(i) ) ofthe intensity at each pixel within the ST-patch P_(i)(i=1 . . . n) mayall be pointing to directions of maximum change of intensity inspace-time, as shown in detail 34. Namely, these gradients may all beperpendicular to the direction ū of intensity lines 33, which may beexpressed as the following linear equation:

${\nabla{P_{i}\begin{bmatrix}u \\v \\w\end{bmatrix}}} = 0$

Different space-time gradients of different pixels in P (e.g., ΔP_(i)and ΔP_(j)) may not necessarily be parallel to each other, but they mayall reside in a single 2-dimensional plane in the space-time volume,such as 2-dimensional plane 36 shown in FIG. 4, which is perpendicularto direction ū.

From the equation describing the perpendicularity of the gradientsΔP_(i)=(P_(x) _(i) ,P_(y) _(i) , P_(t) _(i) ) to the direction ūof thebrush of intensity lines 33, Applicants have derived, as illustrated inFIGS. 5A, 5B and 5C, reference to which is now made, a mathematicalmethod to measure consistency between ST-patches P_(T) and P_(S) (step22). As shown in FIG. 5A, Eq. 1 refers to the equation describing theperpendicularity of the gradients ΔP_(i)=(P_(x) _(i) ,P_(y) _(i) ,P_(t)_(i) ) to the direction Gū=0 of intensity lines 33.

Because there are many different gradients ΔP_(i), but such that thedirection ū is perpendicular to all of them, the next equation, Eq. 2,may be obtained by stacking the equations from all n pixels within thesmall ST-patch P, yielding:

${\underset{G}{\underset{}{\begin{bmatrix}P_{x_{1}} & P_{y_{1}} & P_{t_{1}} \\P_{x_{2}} & P_{y_{2}} & P_{t_{2}} \\\; & \cdots & \; \\\; & \cdots & \; \\P_{x_{n}} & P_{y_{n}} & P_{t_{n}}\end{bmatrix}_{n \times 3}}}\mspace{14mu}\begin{bmatrix}u \\v \\w\end{bmatrix}} = \begin{bmatrix}0 \\0 \\\vdots \\0\end{bmatrix}_{n \times 1}$

where n is the number of pixels in P (e.g., if P is 7×7×3, then n=147).In other words, gradients ΔP_(i) are embedded in a 2-dimensional plane,where direction ū is perpendicular to that plane. When G is used todenote the matrix of gradients ΔP_(i), and ū is used to denote thedirection, Eq. 2 may be simplified as:

Gū=0

This simplified equation is referred to as Eq. 3 in FIG. 5A.

Applicants have realized that two ST-patches P₁ and P₂ may be consideredmotion consistent if there exists a common vector ū=[u,v,w ] whichsatisfies the equations:

G₁ū=0 and G₂ū=0

which is referred to as Eq. 4 in FIG. 5A. Diagram 38 in FIG. 5Aillustrates how the fulfillment of this condition indicates motionconsistent patches. Two two-dimensional planes 39A and 39B, which aredefined by the gradients ΔP_(i) of the local intensity lines inST-patches P₁ and P2, are shown, as are their associated vectors ū₁ andū₂. If vectors ū₁ and ū₂ are the same, then there is consistency betweenthose local intensity lines; that is, the motions which induced thoseintensity lines are consistent.

It will be appreciated that, in the prior art, it was necessary to solvethe equations in Eq. 4 in order to compute the optical flow inST-patches P₁ and P₂. However, it is known in the art that theseequations may not be well-defined and that their solution is difficult.Applicants have realized that it is sufficient to determine whether asolution ū exists and, if it does, then the two patches may beconsidered motion consistent.

As shown in Eq. 5 in FIG. 5B, Eq. 4 may be rewritten such that matricesG₁ and G₂ are stacked into one matrix, yielding:

${\begin{bmatrix}G_{1} \\G_{2}\end{bmatrix} \cdot \overset{\_}{u}} = 0$

The meaning of this equation is that the vector ū should beperpendicular to both planes spanned by the gradients of the twopatches.

The next equation, as shown in Eq. 6 in FIG. 5B, is derived from Eq. 5by multiplying it by the [G₁ G₂] transpose. The left matrix is thendenoted M₁₂.

${\underset{M_{12}}{\underset{}{\begin{bmatrix}G_{1}^{T} & G_{2}^{T}\end{bmatrix}\;\begin{bmatrix}G_{1} \\G_{2}\end{bmatrix}}} \cdot \overset{\_}{u}} = 0$

where M₁₂ is the “gram matrix” of [G₁ G₂]. This matrix is symmetric andpositive-definite, and has the following structure:

$M_{12} = \begin{bmatrix}{\sum P_{x}^{2}} & {\sum{P_{x}P_{y}}} & {\sum{P_{x}P_{t}}} \\{\sum{P_{y}P_{x}}} & {\sum P_{y}^{2}} & {\sum{P_{y}P_{t}}} \\{\sum{P_{t}P_{x}}} & {\sum{P_{t}P_{y}}} & {\sum P_{t}^{2}}\end{bmatrix}$

The gram matrix M₁₂ is shown as Eq. 7 in FIG. 5B. Eq. 8 in FIG. 5A showsthe following equation, which is Eq. 6 simplified as:

M ₁₂ ·ū=0

Applicants have realized that the rank of matrix M₁₂ may be used toindicate whether the two patches P₁ and P₂ are motion consistent, asfollows: if the two individual matrices M₁ and M₂ are of rank 2, thentheir space-time gradients span two-dimensional planes, such as theplanes 39A and 39B, respectively. For each plane 39, there exists anormal vector ū that is perpendicular to all of the space-time gradientsin its associated patch P. If patches P₁ and P₂ are motion consistent,then the two associated planes 39A and 39B, respectively, are coplanar(and coincide) and the rank of the joint matrix M₁₂ is 2. If, however,the rank of matrix M₁₂ is 3, then the planes are not coplanar andpatches P₁ and P₂ are not motion consistent.

Mathematically defined, when there exists at least one vector ū that isnormal to all gradients in the two patches, that is, when the twopatches are motion consistent, then M₁₂ is a 3×3 rank deficient matrix:rank(M₁₂)≦2.

Applicants have also realized that if M₁₂ is not rank-deficient, (i.e.,rank(M₁₂)=3<=>λ_(min)(M₁₂)≠0), then the two patches cannot be motionconsistent.

It will be appreciated that the rank-based constraint method describedhereinabove to determine whether two ST-patches are motion consistent isbased only on the intensities of the two patches, and avoids explicitmotion estimation, which is often very difficult in complex dynamicscenes. Moreover, the two ST-patches need not be of the same size inorder to measure their motion consistency.

Applicants have further realized that the rank-3 constraint on M₁₂ is asufficient but not a necessary condition. Namely, if rank(M₁₂)=3, thenthere is no single image motion which can induce the intensity patternof both ST-patches P₁ and P₂, and therefore they are notmotion-consistent. However, there may be cases in which there is nosingle motion which can induce the two space-time intensity patterns P₁and P₂, yet rank(M₁₂)<3. This can happen, for example, when each of thetwo space-time patches contains only a degenerate image structure (e.g.,an image edge) moving in a uniform motion. In this case, the space-timegradients of each ST-patch will reside on a line in the space-timevolume, all possible vectors ū will span a 2D plane in the space-timevolume, and therefore rank(M₁)=1 and rank(M₂)=1. Since M₁₂=M₁+M₂,therefore rank(M₁₂)≦2<3, regardless of whether there is or isn't motionconsistency between P₁ and P₂.

Applicants have realized that the only case in which the rank-3constraint on M₁₂ is both sufficient and necessary for detecting motioninconsistencies, is when both matrices M₁ and M₂ are each of a ranksmaller than 3 (assuming each ST-patch contains a single motion).

Applicants have therefore generalized the notion of the rank constrainton M₁₂ to obtain a sufficient and necessary motion-consistencyconstraint for both degenerate and non-degenerate ST-patches.

Applicants have realized that all possible ranks of the matrix M of anindividual ST-patch P which contains a single uniform motion are asfollows: rank(M)=2 when P contains a corner-like image feature,rank(M)=1 when P contains an edge-like image feature, rank(M)=0 when Pcontains a uniform colored image region.

This information regarding the spatial properties of P is captured inthe 2×2 upper-left minor M^(⋄) of the matrix M (Eq. 7, FIG. 5B):

$M^{♦} = \begin{bmatrix}{\sum P_{x}^{2}} & {\sum{P_{x}P_{y}}} \\{\sum{P_{y}P_{x}}} & {\sum P_{y}^{2}}\end{bmatrix}$

This is very similar to the matrix used by Harris et. al (C. Harris andM. Stephens. A combined corner and edge detector. In 4^(th) Alvey VisionConference, pages 147-151, 1988), but the summation here is over a3-dimensional space-time patch, and not a 2-dimensional image patch.

In other words, for an ST-patch with a single uniform motion, thefollowing rank condition holds: rank(M)=rank(M^(⋄)). Namely, when thereis a single uniform motion within the ST-patch, the added temporalcomponent (which is captured by the third row and third column of M)does not introduce any increase in rank.

However, when an ST-patch contains more than one motion, i.e., when themotion is not along a single straight line, the added temporal componentmay introduce an increase in the rank, namely: rank(M)=rank(M^(⋄))+1. Itwill be appreciated that the difference in rank cannot be more than 1,because only one column/row is added in the transition from M^(⋄) to M.

Applicants have accordingly derived Eq. 9, as shown in FIG. 5C, asfollows, in which the rank-increase Δr between M and its 2×2 upper-leftminor M^(⋄) is measured, for determining whether one ST-Patch contains asingle motion or multiple motions:

${\Delta \; r} = {{{{rank}\; (M)} - {{rank}\; ( M^{♦} )}} = \{ \begin{matrix}0 & {{single}\mspace{14mu} {motion}} \\1 & {{multiple}\mspace{14mu} {motions}}\end{matrix} }$

It will be appreciated that this is a generalization of the rank-3constraint on M described hereinabove. (When the rank M is 3, then therank of its 2×2 minor is 2, in which case the rank-increase is 1). Thisgeneralized constraint is valid for both degenerate and non-degenerateST-patches.

Following the same reasoning for two different ST-patches, Applicantshave derived a sufficient and necessary condition for detecting motioninconsistency between two ST-patches, as shown in FIG. 5C (Eq. 10) andas follows:

${\Delta \; r} = {{{{rank}\; ( M_{12} )} - {{rank}\; ( M_{12}^{♦} )}} = \{ \begin{matrix}0 & {consistent} \\1 & {inconsistent}\end{matrix} }$

That is, for two ST-patches P₁ and P₂, the rank-increase Δr between M₁₂and its 2×2 upper-left minor M₁₂ ^(⋄) may be measured to determinewhether the two ST-patches are motion-consistent with each other.

It will be appreciated that Eq. 10 is a-generalization of the rank-3constraint on M₁₂ presented hereinabove with respect to Eq. 8. Thisgeneralized constraint is valid for both degenerate and non-degenerateST-patches.

One approach to estimate the rank-increase from M^(⋄) to M is to computetheir individual ranks, and then take the difference, which provides abinary value (0 or 1). The rank of a matrix is determined by the numberof non-zero eigenvalues it has or, more practically, by the number ofeigenvalues above some threshold.

However, due to noise, eigenvalues are never zero. Applying a thresholdto the eigenvalues is usually data-dependent, and a wrong choice of athreshold would lead to wrong rank values. Moreover, the notion ofmotion consistency between two ST-patches (which is based on therank-increase) is often not binary. That is, it is not straightforwardwhether two motions which are very similar but not identical areconsistent or not. Applicants have therefore developed the notion of acontinuous measure of motion consistency between two ST-patches.

The eigenvalues of the 3×3 matrix M may be denoted λ₁≧λ₂≧λ₃, and theeigenvalues of its 2×2 upper-left minor M^(⋄) may be denoted λ₁ ^(⋄)≧λ₂^(⋄). From the Interlacing Property of eigenvalues in symmetricmatrices, as taught by Golub et. al. (G. Golub and C. V. Loan. MatrixComputations. The Johns Hopkins University Press, 1996, p. 396), itfollows that:

λ₁≧λ₁ ^(⋄)≧λ₂≧λ₂ ^(⋄)≧λ₃,

leading to the following two observations, denoted Eq. 11 in FIG. 5C:

${{\lambda_{1} \geq \frac{\lambda_{1} \cdot \lambda_{2} \cdot \lambda_{3}}{\lambda_{1}^{♦} \cdot \lambda_{2}^{♦}}} = {\frac{\det \; (M)}{\det ( M^{♦} )} \geq \lambda_{3}}},{and}$$1 \geq \frac{\lambda_{2} \cdot \lambda_{3}}{\lambda_{1}^{♦} \cdot \lambda_{2}^{♦}} \geq \frac{\lambda_{3}}{\lambda_{1}} \geq 0.$

The continuous rank-increase measure Δr may then be defined as follows(Eq. 12):

${\Delta \; r} = \frac{\lambda_{2} \cdot \lambda_{3}}{\lambda_{1}^{♦} \cdot \lambda_{2}^{♦}}$

0≦Δr≦1. The case of Δr=0 is an ideal case of no rank increase, and whenΔr=1 there is a clear rank increase. However, the above continuousdefinition of Δr may allow analysis of noisy data (without taking anythreshold), and may provide varying degrees of rank-increases forvarying degrees of motion-consistencies.

A space-time video template T may consist of many small ST-patches. Inaccordance with a preferred embodiment of the present invention, it maybe correlated against a larger video sequence by checking itsconsistency with every video segment centered around every space-timepoint (x,y,t) in the large video. Applicants have realized that a goodmatch between the video template T and a video segment S may satisfy twoconditions:

1: that as many corresponding ST-patches as possible between T and S bebrought into “motion-consistent alignment”; and

2: that the alignment between motion discontinuities within the templateT and motion discontinuities within the video segment S be maximized.Such discontinuities may also result from space-time corners and veryfast motion.

Applicants have further realized that a good global template matchshould also minimize the number of local inconsistent matches betweenregular patches (patches not containing motion discontinuity), andshould also minimize the number of matches between regular patches inone sequence with motion discontinuity patches in the other sequence.

Applicants have therefore developed the following measure to capture thedegree of local inconsistency between a small ST-patch P₁εT and aST-patch P₂εS, in accordance with conditions 1 and 2.

$m_{12} = \frac{\Delta \; r_{12}}{{\min \; ( {{\Delta \; r_{1}},{\Delta \; r_{2}}} )} + ɛ}$

where ε avoids division by 0. This measure may yield low values (i.e.,‘consistency’) when P₁ and P₂ are motion consistent with each other (inwhich case, Δr₁₂≈Δr₁≈Δr₂≈0). It may also provide low values when both P₁and P₂ are patches located at motion discontinuities within their ownsequences (in which case, Δr₁₂≈Δr₁≈Δr₂≈1). The variable m₁₂ may providehigh values (i.e., ‘inconsistency’) in all other cases.

Thus, to measure consistency (step 22 of FIG. 3), the method of thepresent invention may determine the value of m₁₂ for every pair ofST-patches.

To obtain a global inconsistency measure between the template T and avideo segment S (step 24), the average value of m₁₂ in T may becomputed:

${\frac{1}{N}{\sum{m_{12}( {x,y,t} )}}},$

where N is the number of space-time points (and therefore also thenumber of ST-patches) in T. Similarly, a global consistency measurebetween the template T and a video segment S may be computed as theaverage value of

$\frac{1}{m_{12}},$

i.e.:

${C( {T,S} )} = {\frac{1}{N}{\sum{\frac{1}{m_{12}( {x,y,t} )}.}}}$

The summation may be a regular one or a weighted summation:

${C( {T,S} )} = \frac{\sum{\frac{1}{m_{12}( {x,y,t} )}*{w( {x,y,t} )}}}{\sum{w( {x,y,t} )}}$

where w(x,y,t) may be a set of weights that may depend, for example, onthe amount of space-time gradients in the patch and penalizes smoothpatches that are, in principle, consistent with any other patch andtherefore non-informative.

An alternative possibility may be to give more weight to the informativecorner-like patches than to the less informative edge-like (degenerate)patches. In some applications, dynamic regions may be more importantthan static ones and therefore, larger weights may be assigned topatches with large temporal gradients of normal-flow values. Iffigure-background segmentation is desired, then a larger weight maypossibly be given to foreground patches than to background ones. Theweights may also be pre-defined by a user who may choose to give moreweights to the more important regions in the template. In all cases, theweights may have continuous values or they may be a binary mask (“soft”vs. “hard” weighting).

A space-time template T (e.g., of size 30×30 pixels×30 frames) may thusbe correlated (step 26) against a larger video sequence S by sliding thetemplate T around sequence S in all three dimensions (x, y and t), whilecomputing its consistency with video segment S at every video location.(To allow flexibility for small changes in scale and orientation,template T and video sequence S may be correlated at a portion, such ashalls of their original resolution). A space-time correlation volume maythus be generated. Peaks within this correlation volume may beidentified (step 28) as locations in the large video sequence wherebehavior similar to that demonstrated in the template may be detected.Examples of such correlation peaks can be found in FIGS. 6, 7, and 8,reference to which is now made.

FIG. 6 shows the results of employing the method of the presentinvention to detect instances of people walking in a video recorded on abeach. The space-time template T₁ was a very short video clip (14 framesof 60×70 pixels) of a man walking. A series of a few sample frames oftemplate T₁ is shown in FIG. 6. Template T₁ was then correlated againsta long (460 frames of 180×360 pixels) video V₁ recorded on a beach, inorder to detect instances of people walking in video V₁. A series of afew sample frames from video V₁ is shown in FIG. 6. In order to detectinstances of people walking in both directions in video V₁, Template T₁was correlated twice with V₁. That is, both T₁ and the mirror image ofT₁ were correlated with V₁.

Finally, the results of the correlation may be seen in FIG. 6. Theseries of frames denoted by reference numeral (C+V)₁ shows the peaks ofthe resulting space-time correlation volume C₁(x,y,t) superimposed onV₁. All walking people, despite being of different shapes and sizes,dressed differently, and walking against a different background than thewalking person in template T₁, were detected, as indicated by the brightpatches highlighting the walking figures in frame series (C+V)₁ in FIG.6.

FIG. 7 shows an analysis of video footage of a ballet. In this example,the space-time template T₂ contains a single turn of a male dancer (13frames of 90×110 pixels). A series of a few sample frames of template T₂is shown in FIG. 7. A series of a few sample frames of the longer (284frames of 144×192 pixels) ballet clip V₂ against which template T₂ wasthen correlated is also shown in FIG. 7.

The series of frames denoted by reference numeral (C+V)₂ shows the peaksof the resulting space-time correlation volume C₂(x,y,t) superimposed onV₂. Most of the turns of the two dancers in V₂ (one male and one female)were detected, as indicated by the bright patches highlighting theturning figures in frame series (C+V)₂, despite the variability in scalerelative to the male dancer in template T₂ (up to 20%). It will also beappreciated that this example contains very fast moving parts(frame-to-frame).

FIG. 8 shows an example in which five different activities occurringsimultaneously are detected. In this example, there are five space-timetemplates, T₃₋₁ through T₃₋₅, in each of which one of the fiveactivities ‘walk’, ‘wave’, ‘clap’, ‘jump’ and ‘fountain’ isdemonstrated. The motion in ‘fountain’ is flowing water. A series of afew sample frames of video V₃ against which templates T₃₋₁ through T₃₋₅were correlated is shown in FIG. 8.

The series of frames denoted by reference numeral (C+V)₃ shows the peaksof the resulting space-time correlation volume C₃(x,y,t) superimposed onV₃. As indicated by the bright patches highlighting the instances of thefive template activities in frame series (C+V)₃, despite the differentpeople performing the activities and the different backgrounds, all theactivities, including the flowing water, were correctly detected.

It will be appreciated that, in regular image correlation, the searchspace is 2-dimensional (the entire image). In space-time correlation, inaccordance with the present invention, the search space is 3-dimensional(the entire video sequence), and the local computations are more complex(e.g., eigenvalue estimations). Applicants have realized that specialcare must be taken regarding computational issues and have made thefollowing observations in order to speed up the space-time correlationprocess significantly:

Firstly, the local matrices M_(3×3) may be computed and stored ahead oftime for all pixels of all video sequences in the database, andseparately for the space-time templates (the video queries). The onlymatrices which require online estimation during the space-timecorrelation process may be the combined matrices M₁₂ (Eq. 7), whichresult from comparing ST-patches in the template with ST-patches in adatabase sequence. This, however, may not require any new gradientestimation during run-time, since M₁₂=M₁+M₂.

Secondly, the rank-increase measure (Eq. 12) may be approximated inorder to avoid eigenvalue computation, which is computationallyexpensive when applied to M₁₂ at every pixel. The rank-increase measuremay be approximated in accordance with the following:

Since det(M)=λ₁·λ₂·λ₃, and det(M^(⋄))=λ₁ ^(⋄)·λ₂ ^(⋄), the rank-increasemeasure of Eq. (12) may be rewritten as:

${\Delta \; r} = {\frac{\lambda_{2} \cdot \lambda_{3}}{\lambda_{1}^{♦} \cdot \lambda_{2}^{♦}} = \frac{\det \; (M)}{{\det ( M^{♦} )} \cdot \lambda_{1}}}$

If ∥M∥_(F)=√{square root over (ΣM(i,j)²)} is the Frobenius norm of thematrix M, then the following relation holds between ∥M∥_(F) and λ₁:

λ₁≦∥M∥_(F)≦√{square root over (3λ)}₁

The scalar √{square root over (3)} (≈1.7) is related to the dimension ofM (3×3). The rank-increase measure Δr can therefore be approximated by.

${\Delta \; \hat{r}} = \frac{\det \; (M)}{\det \; {( M^{♦} ) \cdot {M}_{F}}}$

Δ{circumflex over (r)} requires no eigenvalue computation, is easy tocompute from M, and provides the following bounds on the rank-increasemeasure Δr of Eq. 12:

Δ{circumflex over (r)}≦Δr≦√{square root over (3)}Δ{circumflex over (r)}

Although less precise than Δr, Δ{circumflex over (r)} providessufficient separation between ‘rank-increases’ and ‘no-rank-increases’.In the analytic definition of the rank increase measure (Eq. 12),Applicants have realized that Δr attains high values in case of auniform patch (where all eigenvalues might be equally small). In orderto overcome this situation and to add numerical stability to themeasure, Applicants added a small constant to the Frobenius norm in thedenominator of the Δ{circumflex over (r)} equation that corresponds to10 gray-level gradients: e.g.

${\Delta \; \hat{r}} = \frac{\det \; (M)}{\det \; {( M^{♦} ) \cdot ( {{M}_{F} + ɛ} )}}$

This approximated measure may be used to speed-up the space-timecorrelation process provided by the present invention.

Finally, when searching only for correlation peaks, there may be no needto compute the full “correlation volume”. Instead, the method may besped-up by using coarse-to-fine multi-grid search using a smoothnessproperty of the “correlation volume”, as follows:

Initially, the resolution of segment S and template T is reduced anumber of times and the multiple resolutions may be formed into“space-time Gaussian pyramids”. The reduction in the spatial and thetemporal resolutions in each level do not have to be the same. Then, themethod may perform a fill search in the coarsest resolution level tofind several peaks of behavior correlation above some pre-definedthreshold. The locations of the peaks may be found by a common technique(e.g., “non-maximal suppression”) and their total number may be limitedfor further efficiency.

The locations of the peaks may be translated to the next level of higherresolution and a new search may be performed only in a small space-timeneighborhood around each peak, to refine its location. Any refined peaksthat do not pass a threshold in the higher level may be pruned from thelist of peaks. The search process may proceed in a similar manner to thenext levels until the final search in the finest resolution level yieldsthe exact locations of the highest correlation peaks in segment S.

Another possible speed-up relates to the number of patches in template Tthat are computed and that contribute to the final correlation scoreC(T,S), as indicated hereinabove. In the alternative embodiment, insteadof taking patches around all pixels in template T (and their matchingpatches from video segment S), the method may take only a subset ofpatches that represent template T. This subset may be chosen in a sparsespace-time grid of locations in template T. This grid may be distributedhomogeneously/regularly (e.g. every two pixels and two frames assuming alarge overlap between neighboring patches) orinhomogeneously/irregularly by, for example, sampling pixels in regionsof large gradients and/or large motion.

It will be appreciated that the method of the present invention may beutilized for other applications in addition to the action recognitionand detection discussed hereinabove.

In one embodiment, the method may be utilized for automatic sequencematching and registration of two or more sequences containing the sameor a similar human action or other dynamic behavior by compensating fora global space-time parametric transformation. The space-time parametrictransformation between the sequences can be found (e.g. in “AligningSequences and Actions by Maximizing Space-Time Correlations”, Y.Ukrainitz and M. frani, ECCV 2006) using the behavioral correlationmeasure of the present invention alone or in combination with othermeasures. The residual non-parametric inconsistency after the parametricregistration can be used as a measure for spottingdifferences/misalignments between actions or dynamic events. Theseresidual differences can be used, for example, for identifying theidentity of a person by the way they walk (or do other actions).

In another embodiment, the method may be utilized for video search,indexing, or “Intelligent Fast-Forward” (i.e. moving to the nextinstance of the behavioral phenomenon of template T). All of theseapplications may involve searching for instances of a given behavior ina small video clip in a large and long video. The search may be sped upby construction of a search tree or other data structure in apre-process step.

In a further embodiment, the method may be utilized for spatio-temporalclustering and video segmentation into sequences of similar behaviorsand/or similar camera-motions. The behavior-based measure of the presentinvention, generated in step 22, may be used directly or using thefollowing extension to construct an affinity matrix betweenregions/pixels in the video based on the consistency of their motions.

In this embodiment, a basic set of patches with basic motions for basicstructures may be defined. For example, the basic patches might be of astructure, such as a corner, an edge, etc., moving in a set ofpre-defined directions (e.g. 0°, 90°, 180°, 270°) with differentvelocities (e.g., 0.5, 1.0, 1.5 pixels/frame). A “basis vector” may bedefined for these patches and each patch of video segment S may becompared to this vector.

The motion consistency of a given patch with these basic patches, ratherthan with template T, may be determined and a vector of measurements maybe obtained for the given patch. The above example may provide a vectorwith 13 elements for the 4 directions, 3 velocities and one staticpatch. These “feature vectors” represent the local motion profile of thepatch without computing the motion.

The motion consistency measure may then be a simple distance metric (L₁,L₂, L_(∞) or other) between feature vectors of two patches.

With this extension of the consistency measure, the method may then alsocomprise any standard method for clustering or segmentation, such as“Graph-cuts” (e.g. “Fast approximate energy minimization via graph cuts”by Y. Boykov, O. Veksler and R. Zabih, PAMI 2001), “Spectral Clustering”(e.g. “On Spectral Clustering: Analysis and an Algorithm” by A. Ng, M.Jordan and Y. Weiss, NIPS 2001), “Multiscale segmentation” (e.g., “FastMultiscale Image Segmentation” E. Sharon, A. Brandt and R. Basri, CVPR2000), or “Mean-Shift” (e.g., “Mean Shift: A Robust Approach TowardsFeature Space Analysis” by D. Comaniciu and P. Meer, PAMI 2002). Theresult may be clustering or segmentation of the video sequence intoregions of similar motions.

With the feature vectors, the method may also include the measurementand/or the comparison of the statistical dynamical properties of videodata. In one embodiment, the method may construct histograms of thevectors in large regions with many patches. Regions with similarstatistical distribution of motions may have similar histograms. Anystandard distance measure on histograms (e.g., KL divergence, Chi-Squaredistance, Minimal Earth Distance) may then be used to compare thehistograms. Such a method may be used to cluster a video sequence intodifferent dynamical textures and stochastic motion (e.g., flowing water,foam, fire . . . ) or to compare two or more video sequences with thesedistributions of motions.

While certain features of the invention have been illustrated anddescribed herein, many modifications, substitutions, changes, andequivalents will now occur to those of ordinary skill in the art. It is,therefore, to be understood that the appended claims are intended tocover all such modifications and changes as fall within the true spiritof the invention.

What is claimed is:
 1. A method comprising: measuring the likelihoodthat two different space-time video segments could have resulted from asimilar underlying motion field without computing said field.
 2. Themethod according to claim 1 and wherein said measuring utilizes pixelvalues directly.
 3. The method according to claim 2 and wherein saidpixel values are at least one of the following: pixel intensities, pixelcolors, filtered intensities, local SSD (sum of square differences)surfaces, correlation surfaces and normalized correlation surfaces. 4.The method according to claim 1 and wherein said measuring comprisesdetermining local motion consistency between corresponding, relativelysmall, space-time patches of said video segments.
 5. The methodaccording to claim 4 and wherein said corresponding space-time patchesare locally shifted relative to each other.
 6. The method according toclaim 4 and wherein said measuring comprises: comparing said space-timepatches to a set of representative patches to generate feature vectors;and determining motion consistency as a function of the distance betweensaid feature vectors of said corresponding space-time patches.
 7. Themethod according to claim 4 and wherein said measuring comprises:comparing said space-time patches to a set of representative patches togenerate feature vectors; and determining motion consistency as afunction of the distance between distributions of said feature vectorsextracted from space-time corresponding regions within said videosegments.
 8. The method according to claim 4 and wherein saiddetermining comprises calculating whether a vector ū with a non-zerotemporal component exists which is perpendicular to space-time gradientsof pixel values in said corresponding space-time patches.
 9. The methodaccording to claim 8 and wherein said pixel values are at least one ofthe following: pixel intensities, pixel colors, filtered intensities,local SSD (sum of square differences) surfaces, correlation surfaces andnormalized correlation -surfaces.
 10. The method according to claim 8and wherein said determining comprises calculating a rank increasebetween a 2×2 upper-left minor matrix of a 3×3 gram matrix of saidspace-time gradients and said 3×3 gram matrix.
 11. The method accordingto claim 10 and wherein said determining comprises calculating acontinuous rank increase measure.
 12. The method according to claim 11and wherein said rank increase measure is approximate.
 13. The methodaccording to claim 4 and also comprising: computing local consistencyscores as a function of said local motion consistency for a multiplicityof said space-time patches within the video segments; aggregating saidlocal consistency scores into a global correlation score between saidvideo segment and said video sequence; determining a correlation volumeof said video sequence with respect to said video segment; andidentifying peaks in said correlation volume, said peaks denotinglocations in said video sequence where behavioral phenomena occur. 14.The method according to claim 13 and wherein said determining comprises:generating multiple resolutions of said video sequence and said videosegment, wherein multiple resolutions are in at least one of space andtime; searching a coarse resolution of said video segment within acoarse resolution of said video sequence to find locations with highmatch values; refining the search in a finer resolution around locationsin said finer resolution corresponding to said locations in said coarseresolution; and repeating said refined search until reaching a desiredresolution level.
 15. The method according to claim 1 and wherein saidmeasuring is employed for identifying locations in a video sequencewhere at least one behavioral phenomenon similar to that demonstrated ina video segment occurs.
 16. The method according to claim 15 and whereinsaid at least one phenomenon is at least one of the following types ofphenomena: a dynamic behavior, an action, a rigid motion, and anon-rigid motion.
 17. The method according to claim 15 and wherein saidat least one phenomenon is a multiplicity of phenomena occurring withina field-of-view of a camera.
 18. The method according to claim 15 andwherein an entity performing said at least one phenomenon in said videosegment does not have the same appearance as an entity performing aphenomenon similar to said at least one phenomenon in said videosequence.
 19. A method according to claim 18 and wherein said entity isat least one of the following: a person, an animal, a rigid object, anon-rigid object.
 20. The method according to claim 15 wherein saidmeasuring is employed for at least one of the following: video search,indexing and fast forward to a next phenomenon of said behavioralphenomenon.
 21. The method according to claim 15 and wherein saidmeasuring is employed for automatic video sequence registration oralignment.
 22. The method according to claim 15 and wherein saidmeasuring is employed for action based identification of an entity. 23.A method according to claim 18 and wherein said entity is at least oneof the following: a person, an animal, a rigid object, a non-rigidobject.
 24. The method according to claim 15 wherein said identifying isemployed for spatio-temporal clustering.
 25. The method according toclaim 15 and wherein said identifying is employed for video segmentationinto sequences of similar phenomena.
 26. A method comprising:constructing a correlation volume between a video segment and a videosequence.
 27. A method comprising: measuring a rank-increase between a2×2 upper-left minor matrix of a 3×3 gram matrix and said 3×3 grammatrix as a function of the discrepancy between the eigenvalues of saidgram matrix and said minor matrix.
 28. The method according to claim 27and wherein said function comprises the ratio between the product of thethree eigenvalues of said gram matrix and the two eigenvalues of saidminor matrix.
 29. The method according to claim 27 and wherein saidfunction comprises the difference between the sum of the threeeigenvalues of said gram matrix and the two eigenvalues of said minormatrix.
 30. The method according to claim 27 and wherein said functioncomprises calculating the following:${\Delta \; r} = \frac{\lambda_{2} \cdot \lambda_{3}}{\lambda_{1}^{♦} \cdot \lambda_{2}^{♦}}$where said eigenvalues of said gram matrix are λ₁≧λ₂≧λ₃, and theeigenvalues of said minor matrix are λ₁ ^(⋄)≧λ₂ ^(⋄).
 31. The methodaccording to claim 27 and wherein said measuring comprises calculating acontinuous rank increase measure.
 32. The method according to claim 31and wherein said rank increase measure is approximate.